Customer lifetime Value is the forecasting of net profits that connect to a specific customer during their lifetime relationship with a business. To state it more simply, CLV is the financial value of a customer’s relationship over the lifetime of the organization’s relationship with that customer.
Customer lifetime Value is important because it gives an idea of the amount of repeat business that can be expected from a specific customer. This knowledge assists the organizations in deciding how much they can profitably invest in buying a particular customer for their business.
Once how much a customer buys and the frequency of their purchases is determined, the organizations have a better understanding of how to manage their limited resources.
To understand the problem of Customer Lifetime Value and provide a data-driven solution through investigating the data and models to explain the relationship between Customer Lifetime Value (Target/Response) and the other variables (Predictors/Explanatory) along with interpreting the visuals, descriptive statistics, tests, and models in terms of the association between the explanatory variables and the response variable.
library(tidyverse)
library(plotly)
library(Hmisc)
library(ggplot2)
library(naniar)
library(ggpubr)
library(PerformanceAnalytics)
library(corrplot)
library(plyr)
library(readr)
library(caret)
library(ISLR)
library(mlbench)
library(measures)
library(earth)
library(scales)
library(lmtest)
library(matrixStats)
library(repr)
library(lubridate)
library(olsrr)
library(superml)
library(MASS)
library(phenex)
library(modelr)
library(broom)
library(magrittr)
library(rpart)
#Loading the dataset
data <- read.csv("Marketing-Customer-Value-Analysis.csv")
## Customer State Response Coverage
## Length:9134 Length:9134 Length:9134 Length:9134
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Education Effective.To.Date EmploymentStatus Gender
## Length:9134 Length:9134 Length:9134 Length:9134
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Income Location.Code Marital.Status Monthly.Premium.Auto
## Min. : 0 Length:9134 Length:9134 Min. : 61.00
## 1st Qu.: 0 Class :character Class :character 1st Qu.: 68.00
## Median :33890 Mode :character Mode :character Median : 83.00
## Mean :37657 Mean : 93.22
## 3rd Qu.:62320 3rd Qu.:109.00
## Max. :99981 Max. :298.00
## Months.Since.Last.Claim Months.Since.Policy.Inception
## Min. : 0.0 Min. : 0.00
## 1st Qu.: 6.0 1st Qu.:24.00
## Median :14.0 Median :48.00
## Mean :15.1 Mean :48.06
## 3rd Qu.:23.0 3rd Qu.:71.00
## Max. :35.0 Max. :99.00
## Number.of.Open.Complaints Number.of.Policies Policy.Type
## Min. :0.0000 Min. :1.000 Length:9134
## 1st Qu.:0.0000 1st Qu.:1.000 Class :character
## Median :0.0000 Median :2.000 Mode :character
## Mean :0.3844 Mean :2.966
## 3rd Qu.:0.0000 3rd Qu.:4.000
## Max. :5.0000 Max. :9.000
## Policy Renew.Offer.Type Sales.Channel Total.Claim.Amount
## Length:9134 Length:9134 Length:9134 Min. : 0.099
## Class :character Class :character Class :character 1st Qu.: 272.258
## Mode :character Mode :character Mode :character Median : 383.945
## Mean : 434.089
## 3rd Qu.: 547.515
## Max. :2893.240
## Vehicle.Class Vehicle.Size Customer.Lifetime.Value
## Length:9134 Length:9134 Min. : 1898
## Class :character Class :character 1st Qu.: 3994
## Mode :character Mode :character Median : 5780
## Mean : 8005
## 3rd Qu.: 8962
## Max. :83325
## data
##
## 24 Variables 9134 Observations
## --------------------------------------------------------------------------------
## Customer
## n missing distinct
## 9134 0 9134
##
## lowest : AA10041 AA11235 AA16582 AA30683 AA34092
## highest: ZZ77357 ZZ83340 ZZ89380 ZZ91716 ZZ97035
## --------------------------------------------------------------------------------
## State
## n missing distinct
## 9134 0 5
##
## lowest : Arizona California Nevada Oregon Washington
## highest: Arizona California Nevada Oregon Washington
##
## Value Arizona California Nevada Oregon Washington
## Frequency 1703 3150 882 2601 798
## Proportion 0.186 0.345 0.097 0.285 0.087
## --------------------------------------------------------------------------------
## Response
## n missing distinct
## 9134 0 2
##
## Value No Yes
## Frequency 7826 1308
## Proportion 0.857 0.143
## --------------------------------------------------------------------------------
## Coverage
## n missing distinct
## 9134 0 3
##
## Value Basic Extended Premium
## Frequency 5568 2742 824
## Proportion 0.61 0.30 0.09
## --------------------------------------------------------------------------------
## Education
## n missing distinct
## 9134 0 5
##
## lowest : Bachelor College Doctor High School or Below Master
## highest: Bachelor College Doctor High School or Below Master
##
## Value Bachelor College Doctor
## Frequency 2748 2681 342
## Proportion 0.301 0.294 0.037
##
## Value High School or Below Master
## Frequency 2622 741
## Proportion 0.287 0.081
## --------------------------------------------------------------------------------
## Effective.To.Date
## n missing distinct
## 9134 0 59
##
## lowest : 01/01/11 01/02/11 01/03/11 01/04/11 01/05/11
## highest: 2/24/11 2/25/11 2/26/11 2/27/11 2/28/11
## --------------------------------------------------------------------------------
## EmploymentStatus
## n missing distinct
## 9134 0 5
##
## lowest : Disabled Employed Medical Leave Retired Unemployed
## highest: Disabled Employed Medical Leave Retired Unemployed
##
## Value Disabled Employed Medical Leave Retired
## Frequency 405 5698 432 282
## Proportion 0.044 0.624 0.047 0.031
##
## Value Unemployed
## Frequency 2317
## Proportion 0.254
## --------------------------------------------------------------------------------
## Gender
## n missing distinct
## 9134 0 2
##
## Value F M
## Frequency 4658 4476
## Proportion 0.51 0.49
## --------------------------------------------------------------------------------
## Income
## n missing distinct Info Mean Gmd .05 .10
## 9134 0 5694 0.984 37657 34641 0 0
## .25 .50 .75 .90 .95
## 0 33890 62320 82104 90374
##
## lowest : 0 10037 10074 10097 10105, highest: 99875 99934 99960 99961 99981
## --------------------------------------------------------------------------------
## Location.Code
## n missing distinct
## 9134 0 3
##
## Value Rural Suburban Urban
## Frequency 1773 5779 1582
## Proportion 0.194 0.633 0.173
## --------------------------------------------------------------------------------
## Marital.Status
## n missing distinct
## 9134 0 3
##
## Value Divorced Married Single
## Frequency 1369 5298 2467
## Proportion 0.15 0.58 0.27
## --------------------------------------------------------------------------------
## Monthly.Premium.Auto
## n missing distinct Info Mean Gmd .05 .10
## 9134 0 202 1 93.22 33.77 62.0 64.0
## .25 .50 .75 .90 .95
## 68.0 83.0 109.0 129.0 163.4
##
## lowest : 61 62 63 64 65, highest: 290 295 296 297 298
## --------------------------------------------------------------------------------
## Months.Since.Last.Claim
## n missing distinct Info Mean Gmd .05 .10
## 9134 0 36 0.999 15.1 11.57 1 2
## .25 .50 .75 .90 .95
## 6 14 23 30 33
##
## lowest : 0 1 2 3 4, highest: 31 32 33 34 35
## --------------------------------------------------------------------------------
## Months.Since.Policy.Inception
## n missing distinct Info Mean Gmd .05 .10
## 9134 0 100 1 48.06 32.21 4 10
## .25 .50 .75 .90 .95
## 24 48 71 87 93
##
## lowest : 0 1 2 3 4, highest: 95 96 97 98 99
## --------------------------------------------------------------------------------
## Number.of.Open.Complaints
## n missing distinct Info Mean Gmd
## 9134 0 6 0.498 0.3844 0.6588
##
## lowest : 0 1 2 3 4, highest: 1 2 3 4 5
##
## Value 0 1 2 3 4 5
## Frequency 7252 1011 374 292 149 56
## Proportion 0.794 0.111 0.041 0.032 0.016 0.006
## --------------------------------------------------------------------------------
## Number.of.Policies
## n missing distinct Info Mean Gmd
## 9134 0 9 0.936 2.966 2.439
##
## lowest : 1 2 3 4 5, highest: 5 6 7 8 9
##
## Value 1 2 3 4 5 6 7 8 9
## Frequency 3251 2294 1168 409 407 372 433 384 416
## Proportion 0.356 0.251 0.128 0.045 0.045 0.041 0.047 0.042 0.046
## --------------------------------------------------------------------------------
## Policy.Type
## n missing distinct
## 9134 0 3
##
## Value Corporate Auto Personal Auto Special Auto
## Frequency 1968 6788 378
## Proportion 0.215 0.743 0.041
## --------------------------------------------------------------------------------
## Policy
## n missing distinct
## 9134 0 9
##
## lowest : Corporate L1 Corporate L2 Corporate L3 Personal L1 Personal L2
## highest: Personal L2 Personal L3 Special L1 Special L2 Special L3
##
## Value Corporate L1 Corporate L2 Corporate L3 Personal L1 Personal L2
## Frequency 359 595 1014 1240 2122
## Proportion 0.039 0.065 0.111 0.136 0.232
##
## Value Personal L3 Special L1 Special L2 Special L3
## Frequency 3426 66 164 148
## Proportion 0.375 0.007 0.018 0.016
## --------------------------------------------------------------------------------
## Renew.Offer.Type
## n missing distinct
## 9134 0 4
##
## Value Offer1 Offer2 Offer3 Offer4
## Frequency 3752 2926 1432 1024
## Proportion 0.411 0.320 0.157 0.112
## --------------------------------------------------------------------------------
## Sales.Channel
## n missing distinct
## 9134 0 4
##
## Value Agent Branch Call Center Web
## Frequency 3477 2567 1765 1325
## Proportion 0.381 0.281 0.193 0.145
## --------------------------------------------------------------------------------
## Total.Claim.Amount
## n missing distinct Info Mean Gmd .05 .10
## 9134 0 5106 1 434.1 300 52.26 104.08
## .25 .50 .75 .90 .95
## 272.26 383.95 547.51 773.43 960.12
##
## lowest : 0.099007 0.382107 0.423310 0.517753 0.769185
## highest: 2345.413441 2452.894264 2552.343856 2759.794354 2893.239678
## --------------------------------------------------------------------------------
## Vehicle.Class
## n missing distinct
## 9134 0 6
##
## lowest : Four-Door Car Luxury Car Luxury SUV Sports Car SUV
## highest: Luxury Car Luxury SUV Sports Car SUV Two-Door Car
##
## Value Four-Door Car Luxury Car Luxury SUV Sports Car
## Frequency 4621 163 184 484
## Proportion 0.506 0.018 0.020 0.053
##
## Value SUV Two-Door Car
## Frequency 1796 1886
## Proportion 0.197 0.206
## --------------------------------------------------------------------------------
## Vehicle.Size
## n missing distinct
## 9134 0 3
##
## Value Large Medsize Small
## Frequency 946 6424 1764
## Proportion 0.104 0.703 0.193
## --------------------------------------------------------------------------------
## Customer.Lifetime.Value
## n missing distinct Info Mean Gmd .05 .10
## 9134 0 8041 1 8005 6158 2475 2662
## .25 .50 .75 .90 .95
## 3994 5780 8962 15433 22064
##
## lowest : 1898.008 1898.684 1904.001 1918.120 1940.981
## highest: 66025.754 67907.270 73225.957 74228.516 83325.381
## --------------------------------------------------------------------------------
It is concluded that there are no missing values in the given dataset.
## [1] "2011-02-24" "2011-01-31" "2011-02-19" "2011-01-20" "2011-02-03"
## [6] "2011-01-25"
Total.Premium.Since.Last.Claim - New feature is created by multiplying Month since last claim, monthly premium auto and number of policies purchased by the customers. this gives the amount that the company is supposed to get from that customer just after the last claim.
Total.Monthly.Premium - This is the product of monthly premium to number of policies purchased by the customer. This is the total amount that a particular customer is expected to pay to the company.
Most of the customers prefer buying Personal L3 policy which comes under Personal Auto policy type.
The agents have played a major role in promoting and selling insurance policies to customers. The agents strive hard to deliver the best insurance policies and services to their customers and hence are preferred over all other sales channel.
Number of policies does not have a significant impact on the total claim amount but it seems to affect the CLV values. It is seen that customers prefer buying two policies. However, they don’t go beyond buying two because in the end they can claim an amount from just one policy so there’s absolutely no need to buy more policies and pay monthly premium’s for them.
There is no gender bias. Both men and women are equally likely to buy Vehicle insurance policies. The ‘ANOVA’ test is performed to understand if there exists a significant difference between the means of the two groups (Male and Female) with respect to the CLV. Since, p-value > 0.05 we fail to reject the null hypothesis and hence there exists no significant difference between the means of the two groups (Male and Female).
## Call:
## aov(formula = Customer.Lifetime.Value ~ Gender, data = data)
##
## Terms:
## Gender Residuals
## Sum of Squares 79863451 431090855476
## Deg. of Freedom 1 9132
##
## Residual standard error: 6870.707
## Estimated effects may be unbalanced
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 7.986e+07 79863451 1.692 0.193
## Residuals 9132 4.311e+11 47206620
85.7% of the responses recorded are negative.
Through ANOVA test, we infer that there is no significant difference in the means of the groups of vehicle size with respect to CLV. It is also observed that the luxury class vehicles have higher CLV compared to other vehicle classes.
## Call:
## aov(formula = Customer.Lifetime.Value ~ Vehicle.Size, data = data)
##
## Terms:
## Vehicle.Size Residuals
## Sum of Squares 224887989 430945830938
## Deg. of Freedom 2 9131
##
## Residual standard error: 6869.928
## Estimated effects may be unbalanced
## Df Sum Sq Mean Sq F value Pr(>F)
## Vehicle.Size 2 2.249e+08 112443995 2.382 0.0924 .
## Residuals 9131 4.309e+11 47195907
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, the bank tries to resolve the complaints raised by its customers. Most of the customers have no pending complaints. We can see that the CLV value decreases as the number of pending complaints increases.
Customers who have no income are unemployed but still contribute to high customer lifetime value. This fact cannot be justified from the data provided. Hence additional data on secondary sources of income must be provided to validate the statement.
The Log Transformation is applied on CLV to transform its skewed distribution to approximately normal. This makes CLV more interpretable
The most profitable customers for the insurance company are those employed with high income. They contribute to high customer lifetime values.
Profitable customers can be segmented on the basis of months since last claim, monthly premium auto and number of policies purchased.
The response from the customers about the purchase of policy is negative in most cases. It is suggested that if the company invest in their customer service, offers provided, it will help in gaining positive response and hence acquire more customers.
All the categorical variables are converted to factors.
Even though number of open complaints and number of policies are numerical variables, they only have 6 and 9 levels respectively. So they are also converted to factors.
## 'data.frame': 9134 obs. of 24 variables:
## $ Customer : chr "BU79786" "QZ44356" "AI49188" "WW63253" ...
## $ State : Factor w/ 5 levels "Arizona","California",..: 5 1 3 2 5 4 4 1 4 4 ...
## $ Response : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ Coverage : Factor w/ 3 levels "Basic","Extended",..: 1 2 3 1 1 1 1 3 1 2 ...
## $ Education : Factor w/ 5 levels "Bachelor","College",..: 1 1 1 1 1 1 2 5 1 2 ...
## $ Effective.To.Date : Date, format: "2011-02-24" "2011-01-31" ...
## $ EmploymentStatus : Factor w/ 5 levels "Disabled","Employed",..: 2 5 2 5 2 2 2 5 3 2 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 1 1 2 2 1 1 2 2 1 ...
## $ Income : int 56274 0 48767 0 43836 62902 55350 0 14072 28812 ...
## $ Location.Code : Factor w/ 3 levels "Rural","Suburban",..: 2 2 2 2 1 1 2 3 2 3 ...
## $ Marital.Status : Factor w/ 3 levels "Divorced","Married",..: 2 3 2 2 3 2 2 3 1 2 ...
## $ Monthly.Premium.Auto : int 69 94 108 106 73 69 67 101 71 93 ...
## $ Months.Since.Last.Claim : int 32 13 18 18 12 14 0 0 13 17 ...
## $ Months.Since.Policy.Inception: int 5 42 38 65 44 94 13 68 3 7 ...
## $ Number.of.Open.Complaints : Factor w/ 6 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Number.of.Policies : Factor w/ 9 levels "1","2","3","4",..: 1 8 2 7 1 2 9 4 2 8 ...
## $ Policy.Type : Factor w/ 3 levels "Corporate Auto",..: 1 2 2 1 2 2 1 1 1 3 ...
## $ Policy : Factor w/ 9 levels "Corporate L1",..: 3 6 6 2 4 6 3 3 3 8 ...
## $ Renew.Offer.Type : Factor w/ 4 levels "Offer1","Offer2",..: 1 3 1 1 1 2 1 1 1 2 ...
## $ Sales.Channel : Factor w/ 4 levels "Agent","Branch",..: 1 1 1 3 1 4 1 1 1 2 ...
## $ Total.Claim.Amount : num 385 1131 566 530 138 ...
## $ Vehicle.Class : Factor w/ 6 levels "Four-Door Car",..: 6 1 6 5 1 6 1 1 1 1 ...
## $ Vehicle.Size : Factor w/ 3 levels "Large","Medsize",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Customer.Lifetime.Value : num 2764 6980 12887 7646 2814 ...
## State Response Coverage Education EmploymentStatus Gender Income
## 1 Washington No Basic Bachelor Employed F 56274
## 2 Arizona No Extended Bachelor Unemployed F 0
## 3 Nevada No Premium Bachelor Employed F 48767
## 4 California No Basic Bachelor Unemployed M 0
## 5 Washington No Basic Bachelor Employed M 43836
## 6 Oregon Yes Basic Bachelor Employed F 62902
## Location.Code Marital.Status Monthly.Premium.Auto Months.Since.Last.Claim
## 1 Suburban Married 69 32
## 2 Suburban Single 94 13
## 3 Suburban Married 108 18
## 4 Suburban Married 106 18
## 5 Rural Single 73 12
## 6 Rural Married 69 14
## Months.Since.Policy.Inception Number.of.Open.Complaints Number.of.Policies
## 1 5 0 1
## 2 42 0 8
## 3 38 0 2
## 4 65 0 7
## 5 44 0 1
## 6 94 0 2
## Policy.Type Policy Renew.Offer.Type Sales.Channel Total.Claim.Amount
## 1 Corporate Auto Corporate L3 Offer1 Agent 384.8111
## 2 Personal Auto Personal L3 Offer3 Agent 1131.4649
## 3 Personal Auto Personal L3 Offer1 Agent 566.4722
## 4 Corporate Auto Corporate L2 Offer1 Call Center 529.8813
## 5 Personal Auto Personal L1 Offer1 Agent 138.1309
## 6 Personal Auto Personal L3 Offer2 Web 159.3830
## Vehicle.Class Vehicle.Size Customer.Lifetime.Value
## 1 Two-Door Car Medsize 2763.519
## 2 Four-Door Car Medsize 6979.536
## 3 Two-Door Car Medsize 12887.432
## 4 SUV Medsize 7645.862
## 5 Four-Door Car Medsize 2813.693
## 6 Two-Door Car Medsize 8256.298
The ‘Customer’ variable is removed since it is just ids. For ‘Effective.To.Date’, it only has 2 months, Jan and Feb in the year 2011. So this is also removed.
We tried binning three variables Total Claim Amount, Monthly Premium Auto and Income. But, Binning did not help us improve the model performance. Hence, we did not use it in our final model.
We capped Total Claim Amount, Monthly Premium Auto and Income at 0.97 (97th Quartile). Capping too did not help us improve the model performance. Hence, we did not use it in our final model.
## [1] 6850 22
## [1] 2284 22
The data is split into train and test with a ratio 3:1
Scaling does not significantly impact the regression model. Hence, we have tried but not incorporated it in our modeling.
##
## Call:
## lm(formula = log(Customer.Lifetime.Value) ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58674 -0.05682 -0.00397 0.03957 0.92721
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.335e+00 3.493e-02 209.984 < 2e-16 ***
## StateCalifornia 8.402e-03 7.244e-03 1.160 0.246111
## StateNevada 1.258e-02 1.006e-02 1.250 0.211299
## StateOregon 1.577e-02 7.522e-03 2.096 0.036110 *
## StateWashington 6.825e-03 1.041e-02 0.656 0.511927
## ResponseYes -2.530e-04 8.018e-03 -0.032 0.974829
## CoverageExtended 7.471e-02 9.585e-03 7.794 7.46e-15 ***
## CoveragePremium 9.782e-02 2.011e-02 4.865 1.17e-06 ***
## EducationCollege 4.064e-04 6.589e-03 0.062 0.950824
## EducationDoctor 3.084e-02 1.404e-02 2.197 0.028044 *
## EducationHigh School or Below 1.922e-02 6.685e-03 2.875 0.004049 **
## EducationMaster 2.624e-02 1.002e-02 2.618 0.008860 **
## EmploymentStatusEmployed 5.344e-02 1.386e-02 3.857 0.000116 ***
## EmploymentStatusMedical Leave 3.303e-02 1.686e-02 1.959 0.050170 .
## EmploymentStatusRetired 7.559e-03 1.924e-02 0.393 0.694422
## EmploymentStatusUnemployed -1.470e-02 1.395e-02 -1.054 0.291920
## GenderM -2.445e-02 5.119e-03 -4.776 1.83e-06 ***
## Income 3.011e-07 1.487e-07 2.024 0.042981 *
## Location.CodeSuburban -4.994e-03 1.019e-02 -0.490 0.624146
## Location.CodeUrban 1.787e-03 9.365e-03 0.191 0.848654
## Marital.StatusMarried 1.767e-02 7.494e-03 2.358 0.018413 *
## Marital.StatusSingle -1.743e-02 8.725e-03 -1.998 0.045745 *
## Monthly.Premium.Auto 7.113e-03 3.854e-04 18.458 < 2e-16 ***
## Months.Since.Last.Claim 4.622e-04 2.513e-04 1.839 0.065952 .
## Months.Since.Policy.Inception 7.291e-06 9.157e-05 0.080 0.936538
## Number.of.Open.Complaints1 -1.739e-02 8.274e-03 -2.102 0.035578 *
## Number.of.Open.Complaints2 -2.885e-03 1.283e-02 -0.225 0.822148
## Number.of.Open.Complaints3 -5.925e-02 1.444e-02 -4.103 4.13e-05 ***
## Number.of.Open.Complaints4 -9.907e-02 2.049e-02 -4.836 1.35e-06 ***
## Number.of.Open.Complaints5 -1.287e-01 3.292e-02 -3.910 9.32e-05 ***
## Number.of.Policies2 1.402e+00 6.648e-03 210.811 < 2e-16 ***
## Number.of.Policies3 6.974e-01 8.341e-03 83.609 < 2e-16 ***
## Number.of.Policies4 6.983e-01 1.261e-02 55.355 < 2e-16 ***
## Number.of.Policies5 6.949e-01 1.267e-02 54.850 < 2e-16 ***
## Number.of.Policies6 6.950e-01 1.351e-02 51.438 < 2e-16 ***
## Number.of.Policies7 6.918e-01 1.246e-02 55.542 < 2e-16 ***
## Number.of.Policies8 6.992e-01 1.291e-02 54.161 < 2e-16 ***
## Number.of.Policies9 7.022e-01 1.270e-02 55.272 < 2e-16 ***
## Policy.TypePersonal Auto -8.892e-03 1.340e-02 -0.663 0.507106
## Policy.TypeSpecial Auto 2.053e-02 2.317e-02 0.886 0.375717
## PolicyCorporate L2 -1.935e-02 1.610e-02 -1.202 0.229339
## PolicyCorporate L3 -7.101e-03 1.480e-02 -0.480 0.631421
## PolicyPersonal L1 -3.619e-03 7.946e-03 -0.455 0.648811
## PolicyPersonal L2 2.324e-03 6.760e-03 0.344 0.731063
## PolicyPersonal L3 NA NA NA NA
## PolicySpecial L1 -5.987e-02 3.474e-02 -1.723 0.084870 .
## PolicySpecial L2 1.811e-03 2.707e-02 0.067 0.946658
## PolicySpecial L3 NA NA NA NA
## Renew.Offer.TypeOffer2 7.872e-03 6.435e-03 1.223 0.221237
## Renew.Offer.TypeOffer3 9.135e-03 7.693e-03 1.187 0.235099
## Renew.Offer.TypeOffer4 1.378e-03 9.005e-03 0.153 0.878393
## Sales.ChannelBranch 9.574e-03 6.331e-03 1.512 0.130539
## Sales.ChannelCall Center -2.605e-05 7.161e-03 -0.004 0.997098
## Sales.ChannelWeb -3.199e-03 7.931e-03 -0.403 0.686687
## Total.Claim.Amount 9.080e-06 1.839e-05 0.494 0.621585
## Vehicle.ClassLuxury Car 2.092e-02 5.331e-02 0.393 0.694685
## Vehicle.ClassLuxury SUV 5.819e-02 5.292e-02 1.100 0.271555
## Vehicle.ClassSports Car 1.513e-01 2.039e-02 7.420 1.32e-13 ***
## Vehicle.ClassSUV 1.436e-01 1.763e-02 8.144 4.52e-16 ***
## Vehicle.ClassTwo-Door Car 6.173e-04 6.637e-03 0.093 0.925907
## Vehicle.SizeMedsize -5.723e-04 8.404e-03 -0.068 0.945710
## Vehicle.SizeSmall 2.768e-03 9.794e-03 0.283 0.777497
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2088 on 6790 degrees of freedom
## Multiple R-squared: 0.8986, Adjusted R-squared: 0.8977
## F-statistic: 1019 on 59 and 6790 DF, p-value: < 2.2e-16
The above linear model contains all the features.
When we plot the density plot for ‘Customer.Lifetime.Value’, its visible that the graph is skewed towards the left.To tackle this we take log(Customer.Lifetime.Value).
The log transformation reduces or removes the skewness of our original data. Log transformation also de-emphasizes outliers and allows us to potentially obtain a bell-shaped distribution. The idea is that taking the log of the data can restore symmetry to the data.
The adj R-squared value after using log is found to be 0.8977
## [1] "RMSE 3937.60929439273"
## [1] "Adj R2 0.670865413869209"
## [1] "RMSE 4045.75854712155"
## [1] "Adj R2 0.648896880168791"
To predict the RMSE and adj R squared for train and test together we created a function eval_fns. Since log was taken for Customer.Lifetime.Value, exponential of the predict function is taken. Adj R-squared for train is 0.670 which means that 67% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model. Adj R-squared for test is 0.649 which means that 64.9% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model.
The features are selected based on the importance score and the significant p-values obtained from model_1. We use the method rpart to perform feature selection.Rpart means splitting the dataset recursively, which means that the subsets that arise from a split are further split until a predetermined termination criterion is reached.
##
## Call:
## lm(formula = log(Customer.Lifetime.Value) ~ +Total.Claim.Amount +
## Coverage + Education + EmploymentStatus + Marital.Status +
## Monthly.Premium.Auto + Number.of.Open.Complaints + Number.of.Policies +
## Policy + Renew.Offer.Type + Vehicle.Class, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56716 -0.05499 -0.00481 0.03973 0.91359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.340e+00 3.186e-02 230.364 < 2e-16 ***
## Total.Claim.Amount -6.295e-06 1.308e-05 -0.481 0.63029
## CoverageExtended 7.297e-02 9.584e-03 7.613 3.04e-14 ***
## CoveragePremium 9.336e-02 2.012e-02 4.641 3.53e-06 ***
## EducationCollege 8.248e-04 6.589e-03 0.125 0.90038
## EducationDoctor 2.981e-02 1.404e-02 2.124 0.03374 *
## EducationHigh School or Below 1.991e-02 6.677e-03 2.982 0.00287 **
## EducationMaster 2.721e-02 1.001e-02 2.719 0.00657 **
## EmploymentStatusEmployed 6.193e-02 1.286e-02 4.817 1.49e-06 ***
## EmploymentStatusMedical Leave 3.044e-02 1.686e-02 1.806 0.07100 .
## EmploymentStatusRetired 1.580e-03 1.880e-02 0.084 0.93301
## EmploymentStatusUnemployed -2.253e-02 1.363e-02 -1.653 0.09831 .
## Marital.StatusMarried 1.871e-02 7.470e-03 2.505 0.01228 *
## Marital.StatusSingle -1.731e-02 8.695e-03 -1.991 0.04653 *
## Monthly.Premium.Auto 7.284e-03 3.809e-04 19.123 < 2e-16 ***
## Number.of.Open.Complaints1 -1.744e-02 8.274e-03 -2.107 0.03513 *
## Number.of.Open.Complaints2 -3.557e-03 1.283e-02 -0.277 0.78155
## Number.of.Open.Complaints3 -5.950e-02 1.444e-02 -4.122 3.81e-05 ***
## Number.of.Open.Complaints4 -9.758e-02 2.050e-02 -4.760 1.98e-06 ***
## Number.of.Open.Complaints5 -1.306e-01 3.294e-02 -3.964 7.46e-05 ***
## Number.of.Policies2 1.401e+00 6.650e-03 210.719 < 2e-16 ***
## Number.of.Policies3 6.976e-01 8.334e-03 83.696 < 2e-16 ***
## Number.of.Policies4 6.985e-01 1.261e-02 55.379 < 2e-16 ***
## Number.of.Policies5 6.925e-01 1.266e-02 54.717 < 2e-16 ***
## Number.of.Policies6 6.967e-01 1.349e-02 51.646 < 2e-16 ***
## Number.of.Policies7 6.919e-01 1.245e-02 55.549 < 2e-16 ***
## Number.of.Policies8 6.991e-01 1.291e-02 54.147 < 2e-16 ***
## Number.of.Policies9 6.985e-01 1.270e-02 55.002 < 2e-16 ***
## PolicyCorporate L2 -1.848e-02 1.612e-02 -1.147 0.25161
## PolicyCorporate L3 -6.032e-03 1.481e-02 -0.407 0.68387
## PolicyPersonal L1 -1.142e-02 1.445e-02 -0.790 0.42950
## PolicyPersonal L2 -6.086e-03 1.383e-02 -0.440 0.65997
## PolicyPersonal L3 -7.974e-03 1.341e-02 -0.595 0.55200
## PolicySpecial L1 -3.989e-02 3.151e-02 -1.266 0.20558
## PolicySpecial L2 2.579e-02 2.282e-02 1.130 0.25857
## PolicySpecial L3 2.073e-02 2.320e-02 0.893 0.37164
## Renew.Offer.TypeOffer2 7.375e-03 6.283e-03 1.174 0.24052
## Renew.Offer.TypeOffer3 8.304e-03 7.610e-03 1.091 0.27520
## Renew.Offer.TypeOffer4 -1.070e-03 8.829e-03 -0.121 0.90356
## Vehicle.ClassLuxury Car 8.885e-03 5.333e-02 0.167 0.86768
## Vehicle.ClassLuxury SUV 4.446e-02 5.290e-02 0.840 0.40071
## Vehicle.ClassSports Car 1.453e-01 2.038e-02 7.129 1.11e-12 ***
## Vehicle.ClassSUV 1.389e-01 1.763e-02 7.878 3.83e-15 ***
## Vehicle.ClassTwo-Door Car 1.107e-03 6.636e-03 0.167 0.86747
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2092 on 6806 degrees of freedom
## Multiple R-squared: 0.898, Adjusted R-squared: 0.8973
## F-statistic: 1393 on 43 and 6806 DF, p-value: < 2.2e-16
This model includes features from the importance score and p-values. here Adj R2 is found to 0.8973.
## [1] "RMSE 3945.04179126583"
## [1] "Adj R2 0.669621714777904"
## [1] "RMSE 4043.91530696999"
## [1] "Adj R2 0.64921673116145"
Adj R-squared for train is 0.669 which means that 66.9% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model. Adj R-squared for test is 0.649 which means that 64.9% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model.
##
## Call:
## lm(formula = log(Customer.Lifetime.Value) ~ +poly(Total.Claim.Amount,
## 3, raw = TRUE) + Coverage + Education + EmploymentStatus +
## Marital.Status + poly(Monthly.Premium.Auto, 3, raw = TRUE) +
## Number.of.Open.Complaints + Number.of.Policies + Policy +
## Renew.Offer.Type + Vehicle.Class, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42721 -0.02933 -0.00126 0.02021 0.86818
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 6.461e+00 6.515e-02 99.167
## poly(Total.Claim.Amount, 3, raw = TRUE)1 1.298e-05 4.134e-05 0.314
## poly(Total.Claim.Amount, 3, raw = TRUE)2 -2.065e-08 5.999e-08 -0.344
## poly(Total.Claim.Amount, 3, raw = TRUE)3 4.660e-12 2.194e-11 0.212
## CoverageExtended -1.840e-02 1.060e-02 -1.736
## CoveragePremium -4.328e-02 2.127e-02 -2.035
## EducationCollege -1.917e-03 6.430e-03 -0.298
## EducationDoctor 3.027e-02 1.370e-02 2.209
## EducationHigh School or Below 1.572e-02 6.520e-03 2.412
## EducationMaster 2.659e-02 9.768e-03 2.722
## EmploymentStatusEmployed 5.722e-02 1.257e-02 4.553
## EmploymentStatusMedical Leave 2.156e-02 1.646e-02 1.310
## EmploymentStatusRetired 3.524e-03 1.839e-02 0.192
## EmploymentStatusUnemployed -2.728e-02 1.332e-02 -2.049
## Marital.StatusMarried 1.634e-02 7.292e-03 2.241
## Marital.StatusSingle -1.863e-02 8.490e-03 -2.194
## poly(Monthly.Premium.Auto, 3, raw = TRUE)1 2.623e-02 1.457e-03 17.997
## poly(Monthly.Premium.Auto, 3, raw = TRUE)2 -1.000e-04 1.005e-05 -9.948
## poly(Monthly.Premium.Auto, 3, raw = TRUE)3 1.524e-07 2.137e-08 7.133
## Number.of.Open.Complaints1 -1.551e-02 8.075e-03 -1.920
## Number.of.Open.Complaints2 -6.288e-03 1.252e-02 -0.502
## Number.of.Open.Complaints3 -5.942e-02 1.409e-02 -4.219
## Number.of.Open.Complaints4 -9.348e-02 2.001e-02 -4.672
## Number.of.Open.Complaints5 -1.234e-01 3.214e-02 -3.839
## Number.of.Policies2 1.399e+00 6.489e-03 215.616
## Number.of.Policies3 6.946e-01 8.134e-03 85.396
## Number.of.Policies4 6.941e-01 1.231e-02 56.395
## Number.of.Policies5 6.905e-01 1.235e-02 55.898
## Number.of.Policies6 6.939e-01 1.316e-02 52.716
## Number.of.Policies7 6.916e-01 1.215e-02 56.915
## Number.of.Policies8 6.953e-01 1.260e-02 55.174
## Number.of.Policies9 6.931e-01 1.240e-02 55.919
## PolicyCorporate L2 -2.302e-02 1.574e-02 -1.463
## PolicyCorporate L3 -5.855e-03 1.447e-02 -0.405
## PolicyPersonal L1 -1.238e-02 1.413e-02 -0.876
## PolicyPersonal L2 -9.157e-03 1.352e-02 -0.677
## PolicyPersonal L3 -9.434e-03 1.310e-02 -0.720
## PolicySpecial L1 -4.344e-02 3.078e-02 -1.412
## PolicySpecial L2 1.870e-02 2.229e-02 0.839
## PolicySpecial L3 2.154e-02 2.265e-02 0.951
## Renew.Offer.TypeOffer2 1.008e-02 6.133e-03 1.644
## Renew.Offer.TypeOffer3 1.161e-02 7.429e-03 1.564
## Renew.Offer.TypeOffer4 3.609e-03 8.621e-03 0.419
## Vehicle.ClassLuxury Car -3.369e-02 5.403e-02 -0.624
## Vehicle.ClassLuxury SUV 4.669e-03 5.390e-02 0.087
## Vehicle.ClassSports Car -1.085e-02 2.216e-02 -0.489
## Vehicle.ClassSUV -2.117e-02 1.975e-02 -1.072
## Vehicle.ClassTwo-Door Car 2.595e-03 6.476e-03 0.401
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## poly(Total.Claim.Amount, 3, raw = TRUE)1 0.753465
## poly(Total.Claim.Amount, 3, raw = TRUE)2 0.730712
## poly(Total.Claim.Amount, 3, raw = TRUE)3 0.831790
## CoverageExtended 0.082674 .
## CoveragePremium 0.041857 *
## EducationCollege 0.765604
## EducationDoctor 0.027186 *
## EducationHigh School or Below 0.015907 *
## EducationMaster 0.006503 **
## EmploymentStatusEmployed 5.38e-06 ***
## EmploymentStatusMedical Leave 0.190244
## EmploymentStatusRetired 0.848025
## EmploymentStatusUnemployed 0.040520 *
## Marital.StatusMarried 0.025040 *
## Marital.StatusSingle 0.028246 *
## poly(Monthly.Premium.Auto, 3, raw = TRUE)1 < 2e-16 ***
## poly(Monthly.Premium.Auto, 3, raw = TRUE)2 < 2e-16 ***
## poly(Monthly.Premium.Auto, 3, raw = TRUE)3 1.09e-12 ***
## Number.of.Open.Complaints1 0.054844 .
## Number.of.Open.Complaints2 0.615404
## Number.of.Open.Complaints3 2.49e-05 ***
## Number.of.Open.Complaints4 3.04e-06 ***
## Number.of.Open.Complaints5 0.000125 ***
## Number.of.Policies2 < 2e-16 ***
## Number.of.Policies3 < 2e-16 ***
## Number.of.Policies4 < 2e-16 ***
## Number.of.Policies5 < 2e-16 ***
## Number.of.Policies6 < 2e-16 ***
## Number.of.Policies7 < 2e-16 ***
## Number.of.Policies8 < 2e-16 ***
## Number.of.Policies9 < 2e-16 ***
## PolicyCorporate L2 0.143547
## PolicyCorporate L3 0.685756
## PolicyPersonal L1 0.380833
## PolicyPersonal L2 0.498214
## PolicyPersonal L3 0.471393
## PolicySpecial L1 0.158131
## PolicySpecial L2 0.401420
## PolicySpecial L3 0.341567
## Renew.Offer.TypeOffer2 0.100295
## Renew.Offer.TypeOffer3 0.117980
## Renew.Offer.TypeOffer4 0.675501
## Vehicle.ClassLuxury Car 0.532923
## Vehicle.ClassLuxury SUV 0.930983
## Vehicle.ClassSports Car 0.624506
## Vehicle.ClassSUV 0.283957
## Vehicle.ClassTwo-Door Car 0.688656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2041 on 6802 degrees of freedom
## Multiple R-squared: 0.9029, Adjusted R-squared: 0.9023
## F-statistic: 1346 on 47 and 6802 DF, p-value: < 2.2e-16
Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y|x). Basically it adds the quadratic or polynomial terms to the regression.
We have chosen polynomial regression as our 3rd model as it provides the best approximation of the relationship between the dependent and independent variables. Also the polynomial regression fits a wide range of curvature.
Adj R-squared value is found out to be 0.9023
## [1] "RMSE 3866.95258117393"
## [1] "Adj R2 0.682571460172236"
## [1] "RMSE 3947.30847379713"
## [1] "Adj R2 0.665776562392871"
Adj R-squared for train is 0.683 which means that 68.3% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model. Adj R-squared for test is 0.669 which means that 66.9% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model.
##
## Call:
## lm(formula = log(Customer.Lifetime.Value) ~ Number.of.Policies +
## Monthly.Premium.Auto + Total.Claim.Amount + Vehicle.Class +
## State + Coverage + Education + Gender + EmploymentStatus +
## Marital.Status + Months.Since.Last.Claim + Number.of.Open.Complaints +
## Policy.Type + Renew.Offer.Type + Sales.Channel + Vehicle.Size +
## I(Monthly.Premium.Auto^2) + I(Total.Claim.Amount^2) + I(Months.Since.Last.Claim^2),
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47117 -0.03463 -0.00363 0.02381 0.87413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.793e+00 4.384e-02 154.965 < 2e-16 ***
## Number.of.Policies2 1.399e+00 6.504e-03 215.167 < 2e-16 ***
## Number.of.Policies3 6.948e-01 8.158e-03 85.161 < 2e-16 ***
## Number.of.Policies4 6.948e-01 1.233e-02 56.355 < 2e-16 ***
## Number.of.Policies5 6.917e-01 1.240e-02 55.765 < 2e-16 ***
## Number.of.Policies6 6.944e-01 1.320e-02 52.595 < 2e-16 ***
## Number.of.Policies7 6.906e-01 1.218e-02 56.691 < 2e-16 ***
## Number.of.Policies8 6.957e-01 1.264e-02 55.049 < 2e-16 ***
## Number.of.Policies9 6.960e-01 1.243e-02 56.011 < 2e-16 ***
## Monthly.Premium.Auto 1.699e-02 6.856e-04 24.782 < 2e-16 ***
## Total.Claim.Amount 2.205e-05 2.351e-05 0.938 0.348269
## Vehicle.ClassLuxury Car -1.093e-01 5.274e-02 -2.072 0.038310 *
## Vehicle.ClassLuxury SUV -7.798e-02 5.238e-02 -1.489 0.136592
## Vehicle.ClassSports Car -1.624e-02 2.215e-02 -0.733 0.463526
## Vehicle.ClassSUV -2.413e-02 1.977e-02 -1.220 0.222329
## Vehicle.ClassTwo-Door Car 1.967e-03 6.494e-03 0.303 0.761978
## StateCalifornia 9.408e-03 7.076e-03 1.330 0.183713
## StateNevada 1.113e-02 9.841e-03 1.130 0.258336
## StateOregon 1.721e-02 7.357e-03 2.339 0.019374 *
## StateWashington 1.054e-02 1.018e-02 1.035 0.300525
## CoverageExtended -1.100e-02 1.057e-02 -1.040 0.298169
## CoveragePremium -4.427e-02 2.129e-02 -2.079 0.037634 *
## EducationCollege -2.539e-03 6.444e-03 -0.394 0.693641
## EducationDoctor 3.191e-02 1.372e-02 2.325 0.020102 *
## EducationHigh School or Below 1.572e-02 6.538e-03 2.405 0.016204 *
## EducationMaster 2.419e-02 9.790e-03 2.471 0.013486 *
## GenderM -2.677e-02 5.003e-03 -5.352 9.01e-08 ***
## EmploymentStatusEmployed 6.419e-02 1.259e-02 5.099 3.51e-07 ***
## EmploymentStatusMedical Leave 2.845e-02 1.649e-02 1.726 0.084461 .
## EmploymentStatusRetired 1.657e-02 1.844e-02 0.899 0.368815
## EmploymentStatusUnemployed -2.128e-02 1.332e-02 -1.598 0.110110
## Marital.StatusMarried 1.594e-02 7.302e-03 2.183 0.029047 *
## Marital.StatusSingle -1.857e-02 8.502e-03 -2.184 0.029024 *
## Months.Since.Last.Claim -4.798e-04 9.128e-04 -0.526 0.599186
## Number.of.Open.Complaints1 -1.711e-02 8.087e-03 -2.116 0.034370 *
## Number.of.Open.Complaints2 -5.746e-03 1.255e-02 -0.458 0.647106
## Number.of.Open.Complaints3 -5.915e-02 1.411e-02 -4.193 2.79e-05 ***
## Number.of.Open.Complaints4 -9.633e-02 2.004e-02 -4.808 1.56e-06 ***
## Number.of.Open.Complaints5 -1.247e-01 3.219e-02 -3.873 0.000108 ***
## Policy.TypePersonal Auto 1.210e-03 6.033e-03 0.201 0.841056
## Policy.TypeSpecial Auto 2.046e-02 1.311e-02 1.560 0.118692
## Renew.Offer.TypeOffer2 1.151e-02 6.203e-03 1.856 0.063530 .
## Renew.Offer.TypeOffer3 1.208e-02 7.464e-03 1.619 0.105513
## Renew.Offer.TypeOffer4 4.929e-03 8.714e-03 0.566 0.571647
## Sales.ChannelBranch 6.204e-03 6.177e-03 1.004 0.315178
## Sales.ChannelCall Center -3.693e-03 6.993e-03 -0.528 0.597483
## Sales.ChannelWeb -5.127e-03 7.738e-03 -0.663 0.507627
## Vehicle.SizeMedsize 1.463e-03 8.218e-03 0.178 0.858707
## Vehicle.SizeSmall 1.005e-03 9.567e-03 0.105 0.916345
## I(Monthly.Premium.Auto^2) -2.965e-05 1.758e-06 -16.866 < 2e-16 ***
## I(Total.Claim.Amount^2) -1.923e-08 1.574e-08 -1.222 0.221739
## I(Months.Since.Last.Claim^2) 3.202e-05 2.661e-05 1.203 0.228914
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2044 on 6798 degrees of freedom
## Multiple R-squared: 0.9027, Adjusted R-squared: 0.902
## F-statistic: 1237 on 51 and 6798 DF, p-value: < 2.2e-16
##
## studentized Breusch-Pagan test
##
## data: model_1.0
## BP = 2780.7, df = 51, p-value < 2.2e-16
The residuals are not distributed with equal variance throughout the plot.
Heteroskedasticity refers to situations where the variance of the residuals is unequal over a range of measured values. When running a regression analysis, heteroskedasticity results in an unequal scatter of the residuals (also known as the error term).
To formally test for heteroscedasticity, we perform a Breusch-Pagan test. The Breusch-Pagan test uses the following null and alternative hypotheses: - Null Hypothesis (H0): Homoscedasticity is present (the residuals are distributed with equal variance) - Alternative Hypothesis (HA): Heteroscedasticity is present (the residuals are not distributed with equal variance) Since the p-value from the test is less than 2.2e-16 we will reject the null hypothesis and conclude that heteroscedasticity is a problem in this model.
Since heteroscedasticity is present, we will perform weighted least squares by defining the weights in such a way that the observations with lower variance are given more weight.
##
## Call:
## lm(formula = log(Customer.Lifetime.Value) ~ Number.of.Policies +
## Monthly.Premium.Auto + Total.Claim.Amount + Vehicle.Class +
## State + Coverage + Education + Gender + EmploymentStatus +
## Marital.Status + Months.Since.Last.Claim + Number.of.Open.Complaints +
## Policy.Type + Renew.Offer.Type + Sales.Channel + Vehicle.Size +
## I(Monthly.Premium.Auto^2) + I(Total.Claim.Amount^2) + I(Months.Since.Last.Claim^2),
## data = train, weights = wt)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -28.396 -0.411 -0.011 0.304 35.551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.710e+00 1.761e-02 381.100 < 2e-16 ***
## Number.of.Policies2 1.404e+00 9.071e-03 154.769 < 2e-16 ***
## Number.of.Policies3 6.921e-01 5.460e-03 126.758 < 2e-16 ***
## Number.of.Policies4 6.943e-01 9.061e-03 76.625 < 2e-16 ***
## Number.of.Policies5 6.884e-01 8.706e-03 79.073 < 2e-16 ***
## Number.of.Policies6 6.932e-01 9.693e-03 71.518 < 2e-16 ***
## Number.of.Policies7 6.895e-01 8.694e-03 79.316 < 2e-16 ***
## Number.of.Policies8 6.920e-01 9.501e-03 72.827 < 2e-16 ***
## Number.of.Policies9 6.925e-01 9.081e-03 76.257 < 2e-16 ***
## Monthly.Premium.Auto 1.961e-02 3.589e-04 54.647 < 2e-16 ***
## Total.Claim.Amount -1.594e-05 3.312e-06 -4.813 1.52e-06 ***
## Vehicle.ClassLuxury Car -5.441e-02 4.059e-02 -1.341 0.180084
## Vehicle.ClassLuxury SUV -4.936e-02 3.843e-02 -1.284 0.199131
## Vehicle.ClassSports Car -7.377e-02 6.255e-03 -11.794 < 2e-16 ***
## Vehicle.ClassSUV -5.227e-02 4.036e-03 -12.951 < 2e-16 ***
## Vehicle.ClassTwo-Door Car -1.103e-03 5.179e-04 -2.129 0.033281 *
## StateCalifornia 2.322e-03 6.367e-04 3.647 0.000268 ***
## StateNevada 7.233e-04 1.147e-03 0.631 0.528328
## StateOregon 2.688e-03 6.782e-04 3.963 7.47e-05 ***
## StateWashington -5.044e-03 8.940e-04 -5.642 1.75e-08 ***
## CoverageExtended -7.980e-03 2.711e-03 -2.944 0.003252 **
## CoveragePremium -3.711e-02 4.897e-03 -7.578 3.99e-14 ***
## EducationCollege -2.176e-03 6.536e-04 -3.328 0.000878 ***
## EducationDoctor 3.088e-02 2.233e-03 13.832 < 2e-16 ***
## EducationHigh School or Below 1.200e-02 5.454e-04 21.995 < 2e-16 ***
## EducationMaster 2.666e-02 9.843e-04 27.081 < 2e-16 ***
## GenderM -1.586e-02 5.242e-04 -30.259 < 2e-16 ***
## EmploymentStatusEmployed 4.756e-02 1.518e-03 31.326 < 2e-16 ***
## EmploymentStatusMedical Leave 4.421e-03 1.799e-03 2.458 0.014005 *
## EmploymentStatusRetired -3.135e-02 3.795e-03 -8.260 < 2e-16 ***
## EmploymentStatusUnemployed -5.387e-02 1.281e-03 -42.062 < 2e-16 ***
## Marital.StatusMarried 7.014e-03 7.130e-04 9.838 < 2e-16 ***
## Marital.StatusSingle -2.363e-02 8.296e-04 -28.479 < 2e-16 ***
## Months.Since.Last.Claim -4.059e-04 7.386e-05 -5.496 4.02e-08 ***
## Number.of.Open.Complaints1 -9.904e-03 1.336e-03 -7.411 1.40e-13 ***
## Number.of.Open.Complaints2 -2.461e-02 1.350e-03 -18.228 < 2e-16 ***
## Number.of.Open.Complaints3 -4.157e-02 2.645e-03 -15.718 < 2e-16 ***
## Number.of.Open.Complaints4 -7.428e-02 2.251e-03 -32.993 < 2e-16 ***
## Number.of.Open.Complaints5 -6.013e-02 3.853e-03 -15.608 < 2e-16 ***
## Policy.TypePersonal Auto -6.550e-03 4.701e-04 -13.933 < 2e-16 ***
## Policy.TypeSpecial Auto -9.119e-03 1.107e-03 -8.240 < 2e-16 ***
## Renew.Offer.TypeOffer2 -2.834e-03 7.534e-04 -3.761 0.000170 ***
## Renew.Offer.TypeOffer3 -8.123e-03 7.444e-04 -10.913 < 2e-16 ***
## Renew.Offer.TypeOffer4 -4.404e-03 6.483e-04 -6.794 1.18e-11 ***
## Sales.ChannelBranch 4.406e-03 6.045e-04 7.288 3.50e-13 ***
## Sales.ChannelCall Center 3.074e-03 7.391e-04 4.160 3.22e-05 ***
## Sales.ChannelWeb 1.331e-02 7.571e-04 17.579 < 2e-16 ***
## Vehicle.SizeMedsize 4.243e-03 5.561e-04 7.630 2.68e-14 ***
## Vehicle.SizeSmall -3.375e-03 7.668e-04 -4.402 1.09e-05 ***
## I(Monthly.Premium.Auto^2) -3.969e-05 1.796e-06 -22.102 < 2e-16 ***
## I(Total.Claim.Amount^2) -7.122e-11 4.364e-09 -0.016 0.986980
## I(Months.Since.Last.Claim^2) 1.769e-05 2.136e-06 8.281 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.663 on 6798 degrees of freedom
## Multiple R-squared: 0.9635, Adjusted R-squared: 0.9632
## F-statistic: 3514 on 51 and 6798 DF, p-value: < 2.2e-16
One of the key assumptions of regression is that the residuals are distributed with equal variance at each level of the predictor variable. This assumption is known as homoscedasticity. When this assumption is violated, we say that heteroscedasticity is present in the residuals. When this occurs, the results of the regression become unreliable.
We have handled this issue by using weighted polynomial regression, which places weights on the observations such that those with small error variance are given more weight since they contain more information compared to observations with larger error variance.
From summary the adj R-squared value is 0.9632
## [1] "RMSE 3884.68462974516"
## [1] "Adj R2 0.679653625844528"
## [1] "RMSE 3954.83290348664"
## [1] "Adj R2 0.664501142602053"
Adj R-squared for train is 0.679 which means that 67.9% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model. Adj R-squared for test is 0.664 which means that 66.4% of the variation in Customer.Lifetime.Value can be explained by the independent variables included in the model.
Fitted vs Residual graph
Residuals plots should be random in nature and there should not be any pattern in the graph. The average of the residual plot should be close to zero. From the above plot, we can see that the red trend line is almost at zero.
Normal Q-Q Plot
Q-Q plot shows whether the residuals are normally distributed. Ideally, the plot should be on the dotted line. If the Q-Q plot is not on the line then models need to be reworked to make the residual normal. In the above plot, we see that points in the middle lie on the line whereas points in the start and towards the end does not lie on the line.
Scale-Location
This shows how the residuals are spread and whether the residuals have an equal variance or not.
Residuals vs Leverage
The plot helps to find influential observations. Here we need to check for points that are outside the dashed line. A point outside the dashed line will be influential point and removal of that will affect the regression coefficients.
So to improve the Normal Q-Q Plot we perform the weighted polynomial regression. The results after are given below:
Here we can see most of the points lie on the dotted line for the Normal Q-Q plot.
##
## Call:
## lm(formula = log(Customer.Lifetime.Value) ~ Coverage + Education +
## EmploymentStatus + Gender + Income + Marital.Status + Monthly.Premium.Auto +
## Months.Since.Last.Claim + Number.of.Open.Complaints + Number.of.Policies +
## Vehicle.Class, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57550 -0.05414 -0.00398 0.03782 0.91718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.337e+00 2.933e-02 250.157 < 2e-16 ***
## CoverageExtended 7.481e-02 9.562e-03 7.823 5.93e-15 ***
## CoveragePremium 9.743e-02 2.005e-02 4.859 1.20e-06 ***
## EducationCollege 9.630e-04 6.569e-03 0.147 0.88346
## EducationDoctor 3.171e-02 1.397e-02 2.271 0.02320 *
## EducationHigh School or Below 1.957e-02 6.620e-03 2.956 0.00312 **
## EducationMaster 2.665e-02 9.959e-03 2.676 0.00747 **
## EmploymentStatusEmployed 5.524e-02 1.374e-02 4.021 5.86e-05 ***
## EmploymentStatusMedical Leave 3.356e-02 1.682e-02 1.995 0.04605 *
## EmploymentStatusRetired 6.524e-03 1.875e-02 0.348 0.72792
## EmploymentStatusUnemployed -1.336e-02 1.383e-02 -0.966 0.33392
## GenderM -2.435e-02 5.071e-03 -4.802 1.60e-06 ***
## Income 3.269e-07 1.448e-07 2.259 0.02394 *
## Marital.StatusMarried 1.953e-02 7.395e-03 2.641 0.00829 **
## Marital.StatusSingle -1.586e-02 8.607e-03 -1.842 0.06546 .
## Monthly.Premium.Auto 7.143e-03 3.750e-04 19.048 < 2e-16 ***
## Months.Since.Last.Claim 4.687e-04 2.503e-04 1.873 0.06117 .
## Number.of.Open.Complaints1 -1.770e-02 8.214e-03 -2.155 0.03121 *
## Number.of.Open.Complaints2 -4.200e-03 1.278e-02 -0.329 0.74238
## Number.of.Open.Complaints3 -6.027e-02 1.440e-02 -4.186 2.87e-05 ***
## Number.of.Open.Complaints4 -9.839e-02 2.042e-02 -4.819 1.48e-06 ***
## Number.of.Open.Complaints5 -1.292e-01 3.285e-02 -3.934 8.43e-05 ***
## Number.of.Policies2 1.401e+00 6.580e-03 212.957 < 2e-16 ***
## Number.of.Policies3 6.969e-01 8.278e-03 84.183 < 2e-16 ***
## Number.of.Policies4 6.977e-01 1.255e-02 55.595 < 2e-16 ***
## Number.of.Policies5 6.931e-01 1.260e-02 55.020 < 2e-16 ***
## Number.of.Policies6 6.952e-01 1.341e-02 51.827 < 2e-16 ***
## Number.of.Policies7 6.915e-01 1.240e-02 55.768 < 2e-16 ***
## Number.of.Policies8 6.991e-01 1.285e-02 54.388 < 2e-16 ***
## Number.of.Policies9 7.002e-01 1.263e-02 55.441 < 2e-16 ***
## Vehicle.ClassLuxury Car 2.267e-02 5.319e-02 0.426 0.66997
## Vehicle.ClassLuxury SUV 6.028e-02 5.278e-02 1.142 0.25347
## Vehicle.ClassSports Car 1.521e-01 2.033e-02 7.484 8.10e-14 ***
## Vehicle.ClassSUV 1.442e-01 1.758e-02 8.202 2.81e-16 ***
## Vehicle.ClassTwo-Door Car 1.360e-03 6.620e-03 0.205 0.83724
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2087 on 6815 degrees of freedom
## Multiple R-squared: 0.8983, Adjusted R-squared: 0.8978
## F-statistic: 1770 on 34 and 6815 DF, p-value: < 2.2e-16
## GVIF Df GVIF^(1/(2*Df))
## Coverage 6.140049 2 1.574139
## Education 1.045128 4 1.005533
## EmploymentStatus 3.458399 4 1.167776
## Gender 1.010350 1 1.005162
## Income 3.034801 1 1.742068
## Marital.Status 1.257790 2 1.059015
## Monthly.Premium.Auto 25.830224 1 5.082344
## Months.Since.Last.Claim 1.006693 1 1.003341
## Number.of.Open.Complaints 1.031542 5 1.003110
## Number.of.Policies 1.043913 8 1.002690
## Vehicle.Class 20.892248 5 1.355185
We have tried bidirectional stepwise linear regression which does not improve the model.
## [1] "RMSE 4031.98888684047"
## [1] "Adj R2 0.654898433137791"
## [1] "RMSE 4138.61491980185"
## [1] "Adj R2 0.632595216780728"
Decision trees provide an effective method of Decision Making because they clearly lay out the problem so that all options can be challenged. They allow us to analyze fully the possible consequences of a decision and provides a framework to quantify the values of outcomes and the probabilities of achieving them.
This model does not improve the adj R-squared value compared to the weighted polynomial regression model.
We go with weighted polynomial regression model since it gives better value for adj R-squared and RMSE.This model explains most of the variation in target variable i.e. Customer.Lifetime.Value with respect to the independent variables.
The employment status of customers in all the 5 states does not differ significantly, however California and Oregon contribute to highest CLV values. Hence the company should try to up their game in the other states with lesser CLV values.
The response from the customers about the purchase of policy is negative in most cases. It is suggested that if the company invests in improving their customer service, offers provided, it will help in gaining positive response and hence acquire more customers.
Personal policy L3 is able to contribute to highest CLV values by attracting a large customer base. The company should try to advertise the other policies to get them recognized in the market.
Doctors and Master degree holders although few in number compared to other categories contribute to almost equal amounts of CLV. Hence, the company must draw the attention of these categories towards their policies. And more focus should be given to education as more and more doctors and post graduate holders will contribute to higher CLV.
Target customers based on their monthly premium auto and months since last claim.
Sales channels like call center and company websites do not contribute significantly to the CLV. So special emphasis can be given to improve these channels so that more and more customers do not have to just relay on agents to buy a policy.
Customers who have no income are unemployed but still contribute to high customer lifetime value. This fact cannot be justified from the data provided. Hence additional data on secondary sources of income must be provided to validate the statement.
The age of the customer can contribute to the CLV value. Hence this can be explored only if the data is available. Also we can find out if policies are sanctioned to genuine customers(legal age to drive is 16 in the given states of US)
The date of policy purchase is necessary to understand the loyalty of the customer as the number of years they are associated with the company can be calculated and its relation with CLV can be understood.
We divided the work equally amongst ourselves. Each of us had come up with our own EDA plots and tried many models. The models with better RMSE and Adj R-Squared value were selected
“Teamwork is the secret that makes common people achieve uncommon results :P”